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Abstract 

We present a new latent-variable model em- 
ploying a Gaussian mixture integrated with 
a feature selection procedure (the Bernoulli 
part of the model) which together form a 
"Latent Bernoulli-Gauss" distribution. The 
model is applied to MAP estimation, cluster- 
ing, feature selection and collaborative filter- 
ing and fares favorably with the state-of-the- 
art latent- variable models. 



1 Introduction 

We present a new mixture model for collections of 
discrete data with applications to clustering through 
MAP classification, supervised learning, feature selec- 
tion and collaborative filtering. In the language of text 
modeling, the algorithm integrates modeling of word 
frequencies with a feature selection procedure into a 
single latent class distribution model. The algorithm 
defines two types of words (i) keywords representing 
" important" words associated with high frequency ap- 
pearance, and (ii) all remaining words (not including 
stop- words which are omitted from consideration). All 
keywords are "topic specific" modeled by a mixture of 
Gaussians (one per topic) and all remaining words are 
considered "topic unspecific" are modeled by a single 
Gaussian. The decision of which are the keywords of 
a document is modeled by a latent Bernoulli process 
— thus together we have a "Latent Bernoulli-Gauss" 
(LBG) model. 

We present the LBG model in sec. 2 and its applica- 
tions in sec. 2.3. In sec. 3 we present a detailed dis- 
cussion of the merits of LBG as compared to existing 
latent- variable models including Mixture-of-Unigrams 
(MOU) (Nigam et al., 2000), probabihstic Latent Se- 
mantic Indexing (pLSI) (Hofmann, 1999) and Latent 
Dirichlet Allocation (LDA) (Blei et al., 2003). We 
conducted a series of experiments on public datasets 



covering a spectrum of information retrieval applica- 
tions — a detailed discussion of experimental results 
and comparisons to MOU, LDA and pLSI is in sec. 4. 

We use the language of text collections throughout the 
paper, referring to measurements as "word frequen- 
cies" and "documents". Nevertheless, the LBG model 
is general and can be applied (and is applied in sec. 4) 
to a wide range of data analysis tasks. 

2 The Bernoulli-Gauss Mixture Model 

Consider a code-book of size n representing the vo- 
cabulary of n words in a dictionary. A document 
is an unordered collection of N words wi,...,wn 
where Wi G {l,...,n}. A document d is repre- 
sented by the n frequencies of word appearances 
normalized in a proper manner (in text applica- 
tions we use the term-frequency-inverse-document- 
frequency (tf-idf) normalization), resulting in rf = 
(toi, ...,to„) a set of non-negative real numbers. 

For a document d, we distinguish between a "keyword" 
which is associated with a high frequency and other 
low- frequency words of the document. A keyword is 
another way of saying that the word is "important" 
for that document. Let x € {0, 1}" be an indicator 
set where Xi = 1 ii the i'th word in the code-book 
is a keyword and Xi — otherwise. We assume that 
the keywords are modeled by a topic-specific Normal 
distribution whereas all other words are modeled by a 
topic-unspecific Normal distribution. Let y € {1, ..., k} 
be a random variable representing the k possible "top- 
ics" which generated the document d. Let Psi be the 
probability of the z'th code- word to be a keyword in the 
s ~ l,...,fc topic. The Latent Bernoulii-Gauss model 
Pr(d I y ~ s^ff) oi document d given topic y = s is: 



J^ (psiiV(m,;Csi,crJi))''' ((1 -psi)iV(m,;c,,cr,^)) ""' (1) 
1=1 

where N{z ; c, a"^) is the Normal distribution 7V(c, tr^) 



evaluated at 2;, and 9 = (p, c,cr) holds the parame- 
ters of the model. If the i'th code word is a keyword 
{xi = 1) then the word's frequency mi is governed by 
a topic-specific Gaussian distribution N{csi,a1j), oth- 
erwise rrii ~ N{ci, af) a topic- unspecific Gaussian dis- 
tribution which we refer to as a " cross Gaussian" . The 
probability Pr(d \ 6) oi a, document d to be generated 
by the LBG model is found by the mixture: 

fe 
Pr{d\e) = ^Pr{d\y = s,e)Pr{y = s\e) 

s=l 

s 

where X^js •^s ^ ^- Given a training set of docu- 
ments T) = {di,...,dm) we wish to fit the model pa- 
rameters 0, A and select the important code words for 
each document, i.e., estimate X = (xi,...,Xm) where 
Xj e {0, 1}" is the keyword indicator set associated 
with dj. We alternate between two procedures: (i) 
Maximum-Likelihood (ML) estimation of {9, A} given 
X and, (ii) a procedure for estimating X given {9, A}. 

The ML estimation of {9, A} given an i.i.d. training 
set {V, X} takes the form: 



max^ log Pr{dj \ 9,Xj 



j=i 



= max ^ log Y^ \sPr{dj \ y, = s, 9, x^) j 
where Pr{dj \ yj — s,9,x.j) is given by: 

n 

Y[ {ps^N{m■j,]Cs^,(Tl^)Y'' {{I - Psi)N{m.j,]c^,al)Y^'''' 

i=l 

Using the Expectation-Maximization (EM) iterative 
update (Dempster et al., 1977), the following auxil- 
iary function is optimized during the M-step: 

m k 

"^T^Xm ^S^ ^°S {KPr{dj I yj = s, 9, x^)) , 



where fx],- — Pr{yj — s \ dj,Xj,9''*'>) is the poste- 
rior probability given the parameters at iteration (t). 
Optimizing over the auxiliary function at step {t) in- 
troduces an update rule for 9, A: 



A,, 
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(t) 



3 = 1 



E,A*, 






J f^s] j = l 



E,m! 



-l m 

7*y^E 






(3) 



] f^sj -^11 3 = 1 



E,m; 



-Jt) E f^^sj^3^("^3^ - C^*)^ (4) 



The parameters Ci^af of the cross-Gaussians are es- 
timated directly from 2?, X since they do not depend 
on the choice of topics. The posteriors are updated 
during the E-step via application of the Bayes rule: 



4+^)(xAWPr(d, U,=.,x„0(*)), 



(5) 



where oc stands for equality up to normalization, i.e., 

The estimation of X given the data T> and the cur- 
rent estimation of parameters {9, A} is based on the 
following analysis. Consider Natural numbers Qs G Af, 
s — 1, ...,k, representing the number of important code 
words associated with topic s. The expected number 
of important code words gj in document dj is given 
below: 



9] 



/ , ^ji 



E 



l-^sjQs 



(6) 



In other words, the indicator set X is fully determined 
by qi,...,qk and the posteriors fisj (which are esti- 
mated during the EM step above). The indicator Xj 
for document dj, for instance, is defined by the top gj 
highest frequency code words. Our task, therefore, is 
to derive a procedure for estimating qi, ...,qk given the 
parameters 9, A and fi estimated during the EM steps. 

We will begin by establishing an algebraic constraint 
between q = (qi, ...,qi^) and the parameters p, /x: 

Claim 1 Let b = {bi,...,bk) defined by bg = 
(El ^^sj)i^iPs^) for s — 1, ...,k, and let U be an kxm 
matrix holding the posteriors, Ugj = t^sj ■ Then, 



b = UWc\ 



(7) 



Proof: consider the formula representing the ex- 
pected number of important words for a document of 
topic s: 



Es = 



1 



E,Msj- 



E^^j'^j- 



On the other hand, clearly. Eg — "^iPsi since Psi is 
the probability that the i'th code word is important 
for documents of topic s. Substituting the definition 
of gj from cqn. 6, we obtain: 



iYfJ-sj){ 



/ ^ i 






r=l 



] Psj ^3^ 3 = 1 



where the right hand side is the s'th coordinate of 
C/L/Tq. □ 

The conditional-independence assumption WiJ-Wj \ y 
(Naive-Bayes) creates " over-confident" posteriors, i.e., 
Hsj — ^ {0, 1} — a well-known by-product (or side- 
effect) of the Naive Bayes assumption (see (Domin- 
gos & Pazzani, 1997) for a discussion). As a result. 



the constraint C/C/^q = b is simplified considerably: 



UU ss (iza(/((5i, ...,(5fc), where (5s 
Eqn. 7, therefore, reduces to: 



/.I 

1=1 



z^7 ^^sj ~ z^j ^^sj■ 



(8) 



for s = 1, ..., fc. Eqn. 8 is not an effective update rule 
for setting qi, ...,qk because (i) there is no built-in drive 
to generate a sparse p, which as a result, a large num- 
ber of small-valued entries in p will inflate the value 
of (7s, and (ii) once entries of /x settle on {0, 1} values, 
the indicator set X will remain fixed. 

A more effective use of Eqn. 8 is to to set qs as the top 
number of entries in p: 

g^^'-|{* : Ps.>ri*)}|, 

for some, iteration dependent, threshold Tg. In the 
following section we use a similar analysis to derive 
the value of Tg which will conclude the Bernoulli-Gauss 
mixture algorithm. 



2.1 Update Rule for qi, . 



,qk 



Let q* be the (unknown) ground truth value for qs- 
Since Qj (eqn. 6) is the number of keywords in doc- 
ument dj, the probability that a keyword will be se- 
lected in dj, conditioned by topic s, is min{gj/g*, 1}. 
The probability that a keyword will be selected in 
dj and the topic is s is a random variable with a 
Bernoulli distribution with the probability of "suc- 
cess": /isj minj^j/q*, 1}. The expected number of 
times a keyword is selected over the corpus of m doc- 
uments of topic s is the sum of expectations of m 
Bernoulli trials: 






Msjmin{gj/g;,l}- 



On the other hand, the expected number of times the 
i'th code- word (not necessarily a keyword) is selected 
in documents of topic s is: mXsPsi- As a result, for 
the i'th code- word to be a keyword for a document of 
topic s the following condition must be satisfied: 



mXsPst > ^ Hsj mm{gj/ql, 1} > - X! 



f^sj9j, 



3 = 1 



J = l 



where the first inequality is due to the rhs being a 
lower bound for a word to be a keyword, and the latter 
inequality is due to 1 < g* < n. After rearranging 
terms and substituting eqn. 6 for gj we obtain: 



,(t+i) - 



> 



^ m k 

'^J^^-J 7 = 1 r=l 



^rjQ: 



it) 



Note that the right-hand side is the s'th coordinate of 
[/[/^q scaled by l/{nJ2j f^sj)- Given that the poste- 
riors Hsj approach {0,1} values, the condition above 
reduces to: 



Q. 



(t+i) 



{^ 



P. 



n 



(9) 



To conclude, the Bernoulli-Gauss mixture algorithm is 
summarized in Alg. 1. The stopping criteria is when 
Claim 1 is satisfied, but in practice it is sufficient to 
satisfy its reduced form eqn. 9. 

2.2 Evaluating the Model on Novel 
Documents 

Given a new document d — (mi,...,mn), where mi 
is the frequency (tf-idf) of the i'th code- word in the 
document, we wish to evaluate the probability Pr{d) 
of d to arise from the model, and the posteriors 
Hs{d) = Pr{y = s \ d) which provide classification 
(topic assignment) information. A necessary ingredi- 
ent in those calculations is the estimation of the key- 
word indicator set x e {0, 1}" for document d. To 
estimate x associated with the novel document d we 
perform the following steps: 

1. For s = l,...,fc: (i) define Xs as the indicator set 
defined by the top qa code words in d, (ii) compute 
jis fx XsPr{d I y — s,Xs) where Pr{d \ y = s,Xs) is 
defined in eqn. 1. 

2. Set x as the top (1/ ^^ fis) ^^ jJ-sQs code words in 
d. 

Once X is estimated, one can readily compute the pos- 
terior Us ex XsPr{d I y = s,x, 0), s = l,...,fc. and 
Pr{d) from: Pr{d) == J2'l=i KPr{d | y = s, x, 0). 

2.3 Applications of the Model 

The Bernoulli-Gauss mixture model can be used in a 
number of ways and for different data analysis appli- 
cations, as described below: 

Clustering: given documents di, ..., dm, cluster them 
into k classes. Moreover, given a novel document d de- 
termine its class association. The posteriors p,sj for dj 
and class s provide the class assignment of document 
dj. Since posteriors are "over-confident" due to the 
Naive Bayes assumption, the assignment is "hard" in 
practice. For a new document d, the posteriors p,,, (see 
Sec. 2.2) provide the class assignments for s = 1, ..., fc. 

Supervised Inference: given a training set of doc- 
uments with class labels in the set {1, ..., /i} we wish to 
determine the class membership of a given novel docu- 
ment. A possible approach is to estimate a LBG model 
separately for each class producing the model param- 



Algorithm 1 Bernoulli-Gauss Mixture 

Input: Given a training set of documents T) = (di, 
6 = (p, c, cr) for k topics and the Natural numbers qi, .. 



■ .,djn) we wish to fit the model parameters A and 
, qk of top ranking (by tf-idf ) words per topic. 



Initialization: Set initial values X^'^' , 6^^\ q^^K Set the indicators Af'^' from V and q*-*^-*, i.e. 

if-idf value niji is among the top (1/fc) '^^ ql entries in dj. Set t — 0. 
repeat 
t^t + 1 



1 if the 



Update the posteriors fi),- according to Eqn. 5 for j = 1, ..., to and s = 1, ..., fc. 



sj 



Update A^* , p*^*\ c*^*', cr(*) using Eqns. 2-4 and then update the cross-Gaussians. 
Set q(*) using eqn. 9. 



Set A-f*): 



J 



1, 



,(*) 



until j::=A<i^'-E.p^ <e 



1 if the if-idf value niji is among the top X^s/^li ?» entries in dj, for i ~ l,...,n and 



it) 



eters A;,0(,q;, I — l,...,h, and then choose the class 
with the highest probability: argmax; Pr((i | A/, 0;, q^). 

Feature Selection: we can use the Bernoulli-Gauss 
mixture model for selecting features. The selection cri- 
teria is based on psi which is the probability that the 
I'th code word (feature) is a keyword for topic s. We 
" de-select" a feature i if Psi < d for some threshold 6 
for all s ~ 1, ..., k, i.e., a feature that is not a keyword 
in all topics is removed from the set of selected fea- 
tures. In Sec. 4 we apply the feature selection scheme 
above as a filter for Support- Vector-Machine (SVM) 
classification and for K-means clustering. 

Collaborative Filtering: there are applications 
where the indicator set x e {0, 1}" is known, and 
moreover when Xi = the frequency of the i'th 
code word rrii is unknown. Collaborative Filtering 
(CF) is an example of this class of applications where 
d = (to-i, ..., m„) is a list of discrete movie ratings with 
rrii € {1, ..., 5} (stars), of an individual. Each individ- 
ual rates some of the movies, thus x^ = 1 for movies 
being rated and Xi — otherwise. Given a subset of 
ratings made by a new individual, the task of CF is 
to predict movie ratings which were not part of the 
original subset. 

In this case, the cross-Gaussians arc dropped from the 
model, i.e., 



Prid \y = s)=l[ (p,,7V(m,;c,„a2,))"- (l-p,,)i---. 

(10) 
From the training ratings {dj,Xj}, j — 1,...,to,, we 
estimate the model parameters 9, A using Eqns. 2-4 
(there is no need to estimate qi, ■■■,qk since the indica- 
tor sets are known). We are given a new rating {d, x} 
where d = (to-i, ...,mn) and Xi ^ I when mi > 0. Let 
I S {1, ..., n} be a movie we wish to predict its rating by 
the individual d. Similarly to the "Forced Prediction" 



protocol (Breese et al., 1998), we wish to estimate the 
probability Pr{mi — t \ d,9) ior t = 1, ...,5. We start 
by setting Xi — 1 (originally it was zero): 



Pr{mi — t \ d) — 2^ Pr{mi = t \ y ^ s)Pr{y = s \ d), 

s=l 

where 

Pr{mi ^t \ y = s) = N{t-.Csi,(T^i), 

and the posterior Pr{y = s \ d) cc XsPr{d | y = s) is 
estimated through eqn. 10. The movie rating predic- 
tion t* is found by: t* — argmax Pr(?Tii = t\d). 

t 

3 Relationship with Other Latent 
Variable Models 

On a simplistic level, the Bernoulli-Gauss mixture 
model can be viewed as a Gaussian mixture model 
integrated with a feature selection procedure (the 
Bernoulli part of the model) . On a deeper level, how- 
ever, there are subtleties that have to do with the po- 
sitioning of LBG with respect to MOU, LDA and pLSI 
and specifically the manner in which LBG is a genera- 
tive model like MOU and LDA, which we will describe 
below. 

One difference is that LBG models the frequency 
of a code word (per topic) as a Gaussian whereas 
MOU, LDA and pLSI model the probability of ap- 
pearance of code-words as a multinomial — which 
at the limit are really the same, as described next. 
Let f3gi = Pr{w ~ i \ y ^ s) he the probability of 
drawing the i'th code-word given the s'th topic. The 
number of appearances rrii of the i'th code- word in 
a document is governed by a Binomial distribution 
rrii ~ Bin{N, Psi) where N is the number of words 
in the document. By the De-Moivre-Laplace theorem. 



as ^ ^ oo, TOi -- Af{Nf3si, N^siil - Psi))- Therefore, 
in practice since the number of words A^ is a document 
is typically large, the estimated means Cgi in the LBG 
model are equal to N j3si in the multinomial models. 

The De-Moivre-Laplace argument above is also rele- 
vant for the justification of a Gaussian distribution 
as a model of word frequencies (or any other non- 
negative data). It implies that the probability of a 
negative value (in the generative sense) is vanishingly 
small. Successful attempts in using Gaussian mix- 
tures in non-negative numerical contexts, such as for 
collaborative filtering, include (Hofmann, 2003). In 
practice we have not observed any problematic issue 
with a Gaussian modeling and our experimental re- 
ports across a number of application domains (text 
analysis included) make that point as well. 

It will be convenient, in this section, to represent a 
document d = {wi, ...,wm) by the (unordered) set of 
words Wi e {I, ..., n} taking values from a vocabulary 
of n code-words. We will begin the discussion with 
the comparison between the MOU model and LBG. A 
document is generated by the MOU model by a draw 
from a mixture of multinomials as follows. A topic 
is drawn by tossing a fc-faced die whose faces have 
probabilities Ag = Pr{y = s). A word is drawn by the 
toss of an n-faced die where we have k such dice each 
representing a topic s — l,...,fc, with j3si (as defined 
above) representing the probability of the z'th face of 
the n-face word-die associated with topic s. The N 
words of a document are generated by (i) draw a topic 
s by tossing the /c-faced topic-die, then repeat N times: 
(ii) draw a code-word by tossing the s'th word-die. In 
formal language. 



pr{d)^Y.^,\{p: 



s=l i=l 

The model parameters A, (3 can be estimated by the 
EM algorithm. The MOU model is simple and very 
popular in text analysis circles. However, it has a num- 
ber of drawbacks which have served as a catalyst for 
introducing new algorithms, notably pLSI and LDA. 
The notion that all code-words appearance is governed 
by the choice of a single topic is too simplistic. First, 
there are code-words which have a low probability 
of appearance in all topics, i.e., are essentially topic- 
independent, yet are not stop- words. These words un- 
dergo "starvation" in the MOU model as they almost 
never have a chance to be appear in a document gener- 
ated by MOU. Second, polysemy — the coexistence of 
multiple meanings for a code-word — is not modeled 
by MOU. Consider a document d and a code-word w. 
In MOU the posterior probability Pr{y — s \ w,d) is 
independent of d: 

Pr{y = s \ w,d) (X Pr{w \ y — s)Pr{y — s), 



therefore it is not possible to convey multiple meanings 
for the code-word w as a function of other words in the 
document d. 

In the LBG model, the single topic assumption applies 
only to a selected set of code-words, whereas all other 
code-words are governed by a topic-unspecific distri- 
bution. The manner in which this principle plays in a 
generative model is described formally as follows: 

k N 

Pr{d, x) == ^ Ag ]J Pr{wr, Xr \ y = s), 



s—1 r— 1 



where 



Pr{wr — i^Xr \ y ~ s) oc 



Psi j\^^si ^J '^i J- 



In other words, the N words of a document d are gen- 
erated through the following steps: 

• Draw a topic s by tossing the /c-faced topic-die. 

• Toss n coins with biases Psi, i — 1, ...,n to draw 
the indicator vector x G {0, 1}". 

• Create a n-faced word-die by setting /3si to 
{l/N)csi if Xj = 1 or to {1/N)c, if x, ^ 0. The 
probability /3si of the I'th face of the word-die is 
{l/Z)/3si where Z is a normalization factor such 

that J2^|3s^ = I. 

• Repeat A^ times: draw a word from the word-die 
constructed above. 

In other words, in the LBG model the word-die is gen- 
erated per document not only on the basis of the topic 
selection but also based on the selection of keywords. 
With regard to polysemy, the posterior probability 
Pr{y = s I w,d) now depends on d: 

Pr{y — s \ w^d) oz Pr{w \ y ~ s, d)Pr{y — s), 

unlike MOU. The LBG model therefore addresses the 
two main drawbacks of MOU: first being that the 
single-topic assumption does not apply to the entire 
document but only to selected keywords, and sec- 
ondly that the word generation process depends also 
on the document thereby allowing multiple meanings 
to words. Both of those "upgrades" make the under- 
lying model assumptions more realistic than MOU. 
Consequently, the LBG model can be considered as 
a natural extension of the MOU model where some of 
the limiting (and unrealistic) assumptions of MOU are 
relaxed. 

The LDA model addresses the single-topic assumption 
of MOU by allowing multiple topics per document in 



the following manner. To generate the N words of 
a document d, (i) a fc-faced topic-die is generated by 
sampling from a Dirichlet distribution with parameters 
«!, ..., ctfe, then (ii) repeat N times: (a) sample a topic 
s by tossing the topic-die, and (b) sample a word by 
tossing the word-die f3g. 

The parameters a.,f3 oi the LDA model are learned 
through a Variational EM algorithm. Unlike MOU 
and LBG, in the LDA model the topic is selected 
per word rather than once per document. This ap- 
proach definitely solves the single-topic limitation of 
MOU and also the polysemy issue since the posterior 
Pr{y — s I w,d) depends on d: 

Pr{y = s \ w,d) oz Pr{w \ y — s)Pr{y — s \ d). 

However, there is a price to pay for the powerful 
generality of the LDA model. First, the posteriors 
P{y = s \ d) are computationally intractable and in- 
stead are replaced by a mean- field "surrogate" approx- 
imation or by sampling methods. Secondly, by de- 
sign, LDA requires a relatively large number of topics 
k (around ~ 50) which is fine in the world of text but 
is limiting to other data analysis domains where the 
number of "topics" are known to be small (like clus- 
tering applications). 

In practice, LDA is often used for dimensionality re- 
duction (using the variational parameters 7 € i?'^ per 
document) as a filter for SVM classification and for su- 
pervised classification by performing a separate LDA 
modeling per class. Despite the reservations above, 
there are situations where the powerful generality of 
the LDA model pays off — in the domain of text this 
happens when two topics are very similar. In such 
cases, the modeling capacity of MOU and LBG is too 
limited and cannot separate the two classes (see sec. 4 
for details). 

The pLSI model represents the training data as a mix- 
ture of multinomials and, like LDA, also allows for 
multiple topics per document. The pLSI model (un- 
like MOU, LDA and LBG) is not generative, i.e., there 
is no natural way to use the model to assign probabil- 
ity to a novel document. Related to that, the number 
of parameters of the model grows linearly with the 
training set thus risking an over-fitting phenomenon 
to occur. The pLSI model, therefore, is not a natu- 
ral candidate for classification tasks because a novel 
data instance cannot be classified without essentially 
retraining the entire dataset. We refer the reader to 
(Blei et al., 2003) for a detailed comparison between 
LDA and pLSI. We have included pLSI in our exper- 
iments (sec. 4) as one can often obtain good perfor- 
mance if retraining is allowed during classification of 
a novel document. 



4 Experiments 

We conducted experiments with MAP classification, 
feature selection as a filter for SVM and K-mcans, su- 
pervised classification fitting a model per class and 
collaborative filtering. Those experiments were con- 
ducted on a number of datasets including 20News- 
Group^, lOOKMovieLens, and Spambasc from UCI ML 
repository. 

MAP Unsupervised Classification: we begin 
with an unsupervised classification experiment using 
the MAP output of our model (the posteriors /i^ = 
Pr{y = s I d)). We randomly split the data set into 
training and test subsets, generated by mixing records 
from all the topics in the data set. Having stripped all 
the record headers, a code-book is created, compris- 
ing of all words which are not stop-words in the data 
set. We trained a MAP classifier with our model and 
evaluated the classification output by comparing the 
cluster label of each record with its true label, as per 
the 20NewsGroup data set. 

In order to measure the clustering performance, we 
use the zero-one loss function, as follows. Given the 
i'th posting, let Si and Ki be the obtained cluster la- 
bel and the true label, respectively. The accuracy 
(AG), is defined by AC — (1/m) J27Li 5{ni,map{si)), 
where 5{x, y) is an indicator function that equals one 
ii X = y and zero otherwise; and map{si) is the per- 
mutation mapping function that maps each cluster la- 
bel Si to its equivalent label from the data set. The 
optimal mapping is obtained by the Kuhn-Munkres 
algorithm (Lovasz & Plummer, 1986). We compared 
our results with those of MOU, pLSI and LDA. For 
the latter, a clustering decision was made by examin- 
ing the (pi variational parameters that are introduced 
for each record. We repeated the experiments several 
times and the average results are reported in Table 1. 
Note that pLSI retrains the entire data for each new 
test record thus skewing the comparison — yet it is in- 
teresting to note that LBG matched the performance 
nevertheless. Note the large performance gap between 
LBG and MOU underscoring the significant upgrade 
to the MOU model. We conjecture that the relatively 
low accuracy obtained by the LDA model is related to 
the mean-field approximation and as mentioned above, 
LDA is hardly ever used for MAP applications for pre- 
sumably the same reasons. 

Feature Selection: we compared the performance 
of our feature selection procedure, as described in 
sec. 2.3, with the diniensionally reduction offered by 



^Thc 20NcwsGroup data set, taken from the Usenet 
Newsgroup Collection, consists of some 20, 000 newsgroup 
postings, each one categorized to a different topic where 
each topic contains 1000 documents. 
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Figure 1: 20NewsGroup classification result on a binary classification problem, using SVM on the reduced set of features. 
Graph (a) is misc.forsale vs. rec. sport. baseball. Graph (b) is comp. graphics vs. conip.os.nis_windows.misc. 



Table 1: MAP classification performance comparison 
for the 20NewsGroup data set. 



LBG 


MOU 


pLSI 


LDA 


28% 


15% 


27% 


12% 



the variational parameters 7 of the LDA model. In our 
first experiment, we selected a pair of classes from the 
20NewsGroup dataset and performed an SVM classifi- 
cation where the representation of data-instances were 
the selected coordinates given by LBG or the reduced 
dimension vector 7 provided by the LDA model. For 
control purposes we also applied SVM on the raw rep- 
resentation (without the filter). Fig. 4 shows the clas- 
sification accuracy results for two pairs of classes — a 
semantically close pair and a pair of unrelated classes. 
Several experiments were conducted where the pro- 
portion of the training data was varied — from 5% 
to 30%. One can see that the LBG filter produced 
accuracies comparable to raw data use (slightly better 
for small training sets) with consistently better perfor- 
mance than the LDA filter^. Note that all approaches 
suffered when applied to a semantically-related pair of 
classes. 

In the second experiment, we performed an unsuper- 
vised classification using K-means clustering on the 
filtered representations and without the filter (the 
raw data). Results for both semantically-close and 
semantically-unrelated pairs of classes are shown in 
Tables 2 and 3. One can see that LDA can pro- 
duce a superior accuracy when the two classes are 
semantically-close (comp.os.ms_windows.misc versus 
comp. graphics). LBG on the other hand consistently 
outperformed LDA for semantically-unrelated clus- 

^LDA at http://chasen.Org/~ daiti-m/dist/lda/ 



ters. 

Collaborative Filtering: We used the lOOKMovie- 
Lens Collaborative-Filtering data, which consists of 
approximately 100, 000 ratings for 1, 682 movies by 943 
viewers. As discussed in sec. 2.3, we train our model 
using a fully-observed set of viewers. Then, for ev- 
ery test viewer, we suppress a single, randomly-chosen 
movie rating. Our task is to predict the rating, given 
all the other movies for which that viewer has voted 
(known as the "Forced Prediction" protocol). Adopt- 
ing Hofmann (2003) and Breese (1998), we use two 
evaluation metrics which measure the distance of the 
estimated vote m from the true vote m — the mean ab- 
solute error MAE, avg{\'m^m\)^ and the rooted mean 
squared error RMSE avg{{rh — m)^). We then com- 
pared our method to Gassian-pLSA proposed by (Hof- 
mann, 2003) and to the Baseline method that simply 
outputs the mean vote over the entire training data 
for each movie. The results are displayed in Table 5. 
Note that LDA and pLSI do not naturally accommo- 
date the Forced Prediction protocol as they do not 
measure word frequencies, thus were omitted from the 
comparison. One can see that LBG produced a lower 
MAE error compared to both Gaussian-pLSA and the 
Baseline method and slightly lower error on the RMSE 
measure (compared to Baseline). 

Spam Filtering: The Spambase data set from 
the UCI Machine Learning Repository dataset con- 
sists of 4601 of emails ("documents"), character- 
ized by 54 attributes ("words") plus a class label 
(" spam" =positive/" ham" ^negative) where 39% of the 
emails are labeled as spam. We begin with an unsu- 
pervised MAP estimation where Table 4 displays the 
performance of LBG against MOU and pLSI (where 
with pLSI a retraining is required for each test data). 
Note the performance gap between LBG and pLSI — 



Table 2: K-means classification for semantically-close classes. 





comp .OS . ms .windows . niisc 
comp. graphics 


talk. politics. mideast 
talk. politics. misc 


rec . sport . baseball 
rec. sport. hockey 


t alk. religion . misc 
talk. religion. cristianity 


LBG 
LDA 

All 


63.25% 

77.5% 
50.25% 


72.75% 
57.75% 
58.625% 


52.875% 

53% 
50.375% 


57.625% 

58.375% 

50.5% 



Table 3: K-means classification for semantically-unrelatcd classes. 





comp. windows. misc_windows. misc 
rec.autos 


comp . sy s . mac. hardware 
alt. atheism 


alt. atheism 
rec. motorcycles 


rec . sport . baseball 
misc.forsale 


LBG 
LDA 

AU 


93.75% 
84.125% 
95.375% 


97.25% 
90.125% 
96.875% 


94.125% 
58.25% 
88.875% 


93% 
89% 
93% 



Table 4: Unsupervised spam-filter classification per- 
formance comparison. 



LBG 


MOU 


pLSI 


78% 


60% 


65% 



Table 5: MovieLens Collaborative Filtering prediction 
results. 



Method 



Baseline 
Gaussian pLSA 
LBG 



Absolute Error 
MAE RMSE 



0.905 
1.884 
0.776 



1.1445 
2.1142 
1.1183 



this we conjecture has to do with the plausibility of the 
single-topic assumption for spam filtering - words with 
high percentage occurrence serve as a natural discrim- 
inative indicator (for example, an email with repeated 
occurrences of the word "buy" is likely to be spam). 
The performance gap with MOU is attributed to the 
fact that the single-topic assumption is best applied on 
keywords rather than on all words of the document. 

(b) pLSl 





(a) LBG 




~~~-~-_^edicted 


Negative 


Positive 


Actual ^~~~~-~.^ 






Negative 


12.26% 


0.26% 


Positive 


27.13% 


60.34% 



Positive 




~~~~-,^edicted 
Actual ^~'"--\ 


Negative 


Positive 


Negative 
Positive 


14.7% 
24.7% 


4.7% 
55.91% 


(d) MOU 


~~~~~._Predicted 
Actual ^""""----.^ 


Negative 


Positive 


Negative 
Positive 


27.87% 
11.52% 


12.08% 
48.52% 



Figure 2: Confusion tables for supervised spam-filter. 

We then moved to a supervised setting, in which we 
used the class labels (spam/ham) in the training stage. 
We modeled each class separately using LBG, LDA, 



MOU and pLSI while fitting the optimal number of 
topics per model (see note at the end of sec. 2). Note 
that MOU, when k = 1, reduces to the SpamBayes 
algorithm. The confusion table of each method is dis- 
played in Fig. 2. Note the strikingly low false-positive 
(ham classified as spam) result for the LBG model, 
compared to other models. Future work might be 
directed to the development of an enhanced model, 
which will compensate for LBG's limited success with 
false-negatives. 
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